# HALO

## Installation
Create a new environment (`python=3.10` is tested). We provide commands for CUDA 12.4, but using older CUDA versions should be possible with some modifications.
Then run the following commands in order:
```
source install.sh
cd gemm-int8 && export GEMM_INT8_PATH=$(pwd) && cd ..
cd gemm-fp8 && export GEMM_FP8_PATH=$(pwd) && cd ..
```

## Training
First, make sure the environment variables `GEMM_INT8_PATH` and `GEMM_FP8_PATH` point to the `gemm-int8` and `gemm-fp8` directories (like above). To fine-tune a Llama-3-8B model, you can run:
```
cd scripts
CUDA_VISIBLE_DEVICES=0,1,2,3 bash train_halo.sh DATASET=<dataset> LR=<lr> KERNEL_TYPE=<kernel_type>
```

For the dataset and lr you can try the following combinations: (sql, 3e-5), (viggo, 4e-5), (gsm8k, 6e-6). Regarding the kernel type, you can choose any of the following:
- `base`: this runs the base BF16 experiment, with HALO disabled.
- `halo0_fp8`: runs our Halo level 0 with FP8 precision.
- `halo2_int8`: runs our Halo level 2 with INT8 precision.

You can add `_qfsdp` to enable HQ-FSDP, for example: `halo0_fp8_qfsdp`. Other combinations of precision and HALO levels also work, e.g., `halo1_int8_qfsdp`.


## Benchmarks
First, make sure the environment variables `GEMM_INT8_PATH` and `GEMM_FP8_PATH` point to the `gemm-int8` and `gemm-fp8` directories (like the Installation section above). The benchmark files are located in the `tests` directory:
```
cd tests
```

### Linear Module
You can run the single layer benchmarks using the following command:
```
CUDA_VISIBLE_DEVICES=0 python linear_module_benchmark.py --kernels base jetfire switchback halo2_int8 halo1_fp8 halo0_fp8 halo1_fp8
```

### Per-Block Benchmarks
To run the single-gpu block-level benchmarks, run:
```
CUDA_VISIBLE_DEVICES=0 python benchmark_llama3_halo.py --num_blocks 3 --kernels base haloi_int8 haloi_fp8 halo0_fp8 halo1_fp8 halo2_int8
```
Here `haloi` corresponds to the Ideal kernels in the paper.

For multi-gpu INT8 benchmarks, run:
```
NCCL_NTHREADS=64 CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nnodes=1 --nproc-per-node=4 benchmark_llama3_halo.py --fsdp --num_blocks 3 --kernels base haloi_int8 haloi_int8_qfsdp halo2_int8 halo2_int8_qfsdp
```
and for FP8:
```
NCCL_NTHREADS=64 CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nnodes=1 --nproc-per-node=4 benchmark_llama3_halo.py --fsdp --num_blocks 3 --kernels base haloi_fp8 haloi_fp8_qfsdp halo0_fp8 halo0_fp8_qfsdp halo1_fp8 halo1_fp8_qfsdp
```
Note that `NCCL_NTHREADS=64` is tuned for RTX 4090. For newer GPUs, you can use the default value without setting it.
